Multipattern regular expression search systems and methods therefor

ABSTRACT

This disclosure relates generally to a tool, system, and method for searching input data. The system may include a pattern input module, configured to receive regular expression patterns of symbols. An interpreter module may be configured to access individual ones of the symbols of the input data and upon accessing each symbol and compare a thread against the symbol. For each pattern, the thread corresponding to the pattern is compared against the symbol prior to the at least one thread being compared against a subsequent symbol of the input data. An output module may be configured to output an indication of ones of the patterns determined to be contained within the input data based on the comparison of the corresponding at least one thread to the symbols of the input data.

PRIORITY

This application is a continuation of and claims priority to U.S.application Ser. No. 15/664,056, filed Jul. 31, 2017, which is acontinuation of U.S. application Ser. No. 15/076,859, filed Mar. 22,2016, now U.S. Pat. No. 9,720,647, which is a continuation-in-part ofU.S. application Ser. No. 13/786,207, filed Mar. 5, 2013, now U.S. Pat.No. 9,229,026, which claims priority to U.S. Provisional Application No.61/607,288, filed Mar. 6, 2012, the entire contents of each of which isincorporated herein by reference.

TECHNICAL FIELD

The disclosure herein relates generally to multipattern regularexpression searches.

BACKGROUND

Regular expression search tools are well known in the art. A regularexpression is a pattern that specifies any string of characters or othersymbols that meet the terms of the pattern. Regular expressionsincorporate a well known syntax, or “literals”, with operators that canbe utilized to specify multiple strings that may match the pattern, alsoknown as the “language” of the regular expression. For instance, thepattern “ab*c” would be met by strings “ac”, “abc”, “abbc”, etc. Suchregular expression search tools can be used to search an input data forfragments of the input data that meet the regular expression pattern.

DRAWINGS

FIG. 1 is a block diagram of an example search tool.

FIG. 2 is a block diagram of a system that may include or implement thesearch tool of FIG. 1.

FIG. 3 is an illustration of a finite automaton.

FIG. 4 is a nondeterministic finite automaton (NFA) for an exemplarysequence.

FIG. 5 is an illustration of a code point-code point and byte-bytetransformation chain.

FIG. 6 is a flowchart for searching input data including symbols.

FIG. 7 is a block diagram illustrating components of a machine.

FIG. 8 is a table of multiple encodings.

DESCRIPTION

The following description and the drawings sufficiently illustratespecific embodiments to enable those skilled in the art to practicethem. Other embodiments may incorporate structural, logical, electrical,process, and other changes. Portions and features of some embodimentsmay be included in, or substituted for, those of other embodiments.Embodiments set forth in the claims encompass all available equivalentsof those claims.

While digital forensic investigations may involve searching for hundredsor thousands or more of keywords and patterns, certain regularexpression search tools focus on searching line-oriented text files witha single regular expression. Certain tools may search for one pattern ata time, with the input data being search completely through once forevery pattern. Alternatively, patterns may be joined in a single search,but individual patterns are prioritized over other patterns. Thus, tothe extent that the search tool identifies or begins to identify a matchfor a first regular expression in the text file, the search tool may notidentify a second regular expression that overlaps the first regularexpression in the text file. As a result, while a single search mayincorporate multiple regular expressions in a single search, such searchtools may be sensitive to only one expression at a time.

A search tool has been developed for digital forensic investigationsthat include multipattern searches with matches labeled by pattern,relatively efficient scaling to incorporate increased numbers ofpatterns, permits support for large binary streams and long matches, andsupports multiple encodings for a single pattern, such as UTF-8, UTF-16and legacy code pages. A multipattern engine of the search tool mayidentify all the occurrences of patterns in a byte stream, even if somematches overlap. The patterns may have full use of the regularexpression syntax, and may not be limited to fixed strings.

In a conventional search tool, as more patterns are added to a search,the time to conduct the search may increase generally linearly with thenumber of patterns. Thus, all other variables being the same, doublingthe number of search terms may approximately double the time to conductthe search. The search tool disclosed herein may not require linearincreases in search time to search for more patterns, so that it may befaster to search for all the patterns in a single pass of the data thanto perform multiple search passes for individual or groups of patterns.

The search tool may further search byte streams larger, and in certainexamples many times larger, than available system memory and to trackpattern matches that may be hundreds of megabytes or more long. Further,because digital forensic data may tend to be unstructured, it may benecessary to search for occurrences of the same patterns in differentencodings. Concurrent searching for multiple encodings has both generalrelevance, as text is often encoded according to various code sets, andin particular circumstances, such as when searching for text in foreignlanguages, where numerous encodings exist.

Search Tool

FIG. 1 is a block diagram of an example search tool 100. As illustrated,the search tool 100 includes a processor 102 configured to execute anautomaton against input data to be searched as disclosed in detailherein. The processor 102 may be one or more processors,microprocessors, controllers, or other programmable devices, may be orinclude a single- or multiple-core microprocessor, and may bedistributed among multiple platforms as known in the art. In suchexamples, the processor 102 may include a controller for controllingdistributed processing among multiple individual processors.

As illustrated, the search tool 100 further includes memory 104. Thememory 104 may include various volatile and non-volatile electronicmemory technologies known in the art, including various types of randomaccess memory (RAM) and read-only memory (ROM). The memory 104 may beunderstood to include cache memory of the processor 102 as well, and/orelectronic data storage, such as a hard drive.

The search tool 100 further includes a user interface 106 configured tooutput an indication of a result of the search tool 100, among otherinput and output functions. The user interface 106 variously includes avisual display, an input device, such as a keyboard, a mouse, trackballor other related device, a touchscreen, a printer, and, in variousexamples, an electronic data output, such as may output an electronicfile for accessing by an electronic device.

The search tool 100 further optionally includes one or more of a networkinterface 108 and a data port 110. The network interface 108 may be awired or wireless network interface as well known in the art. Thenetwork interface 108 may be utilized as part of the user interface 106,such as may be utilized to transmit data to and receive commands from aremote user interface 106. The network interface 108 may further beutilized to obtain input data for analysis by the search tool 100 andoutput results from the search tool 100. The data port 110 may be anyport or mechanical interface that may interface with a data storagedevice, including, but not limited to, a connector for a hard drive, adisk drive, a port, such as a USB port or other port that may interfacewith a portable storage device, or a socket or electrical contact towhich a chip including input data may be connected or hard-wired so asto obtain data contained thereon.

In various examples, the input data is the complete or essentiallycomplete information contained on or by a particular data storagedevice, such as a hard drive. In an example, the input data may includeboth data that would conventionally be accessible by a user of the harddrive, such as files and other data deliberately stored by a user of thehard drive, as well as file systems and various metadata of the harddrive. Consequently, the input data may include various types andconfigurations of data. In various further examples, the input data maybe obtained from alternative sources, such as wireless data that mayhave been received by an intended destination or intercepted.

The search tool 100 may be or include dedicated and/or custom hardwareand software configured to conduct searches of input data as disclosedin detail herein. The search tool 100 may be a proprietary configurationof commercially available hardware. Alternatively or additionally, thesearch tool 100 may be implemented on commercially available hardwaresystems, such as personal computers, work stations, servers, andcombinations thereof.

System

FIG. 2 is a block diagram of a system 200 that may include or implementthe search tool 100. The system 200 is drawn to refer to thosecomponents and systems involved in performing specific aspects ofsearching operations as modules. While the modules are drawn withspecificity, it is to be understood that, for a given search tool 100and system 200, specific elements may perform tasks or operationsrelevant to the various search operations and thus, dependent uponcertain circumstances, may be understood as variously corresponding toor being assigned to particular modules, such as on a permanent,temporary, or ad hoc basis. The modules may include hardware, such ascorresponds to the search tool 100, as well as software that implementvarious aspects of the search tool 100 and coordinate among varioushardware components of the search tool 100.

A pattern input module 202 may include the memory 104, the userinterface 106, the network interface 108 and/or the data port 110. Thepattern input module 202 may receive patterns that a user of the searchtool 100 and the system 200 may wish to determine is or is not presentin input data, such as may be obtained from the network interface and/orthe data port 110. The patterns may be or include regular expressionpatterns of symbols. The pattern input module 202 may store the variousregular expression patterns in the memory 104.

A data input module 204 may include the user interface 106, the networkinterface 108, and/or the data port 110. The data input module 204 mayobtain the input data. The input data may be obtained serially or in ablock. The input data may be provided from the data input module 204 toother modules and/or the processor 102 and memory 104. The provision ofthe input data from the data input module 204 to other components of thesearch tool 100 and the system 200 may be serial and may, in variousexamples, be effectively concurrent with the obtaining of individualsymbols of the input data in the first instance.

An interpreter module 206 may include the processor 102 and the memory104. The interpreter module 206 may access individual ones of thesymbols, such as upon each symbol of the input data being obtainedserially. The interpreter module 206 may, upon accessing each symbol,compare at least one thread against the symbol as disclosed herein. Theat least one thread may be based on at least one of the patterns asobtained by the pattern input module 202, and each pattern maycorrespond to at least one of the threads. For each pattern, the atleast one thread corresponding to the pattern may be compared againstthe symbol prior to the at least one thread being compared against asubsequent symbol of the input data, as disclosed in detail herein.

A pattern analyzer 208 may include the processor 102 and the memory 104.The pattern analyzer 208 may generate an automaton as disclosed herein.A coding module 210 may include the processor 102 and the memory 104 andmay generate a pattern according to a pattern as input by the patterninput module 202 and various encodings, as disclosed herein.

An output module 212 may include the memory 104, the user interface 106,the network interface 108, and a data port 110. The output module 212may be configured to output an indication of a match of certain patternsin the input data. The output may be relatively simple, such as amessage that at least one pattern was found in the input data, or may berelatively complex, such as which patterns were found in the input data,where the patterns were found, the context of the patterns within theinput data, and so forth.

Finite Automata

FIG. 3 is an illustration of a finite automaton 300. In variousexamples, the finite automaton 300 may consist of a set of states 302,one of which is the initial state 302A, and some of which may beterminal states 302B. Pairs of states 302 may have one or moretransitions 304 from one state 302 to the other. Each transition 304 maycorrespond to symbol in the input data. A symbol of the input data mayrange from a single bit, a single byte, a combination of bytes thatcorresponds to a particular character, or other discrete and/ordefinable collection of data that may correspond to identifiableinformation.

A finite automaton 300 may be generated by the pattern analyzer 208 andmay be implemented by the interpreter module 206. The finite automaton300 may read characters from the input data. The interpreter module 206may step through an automaton 300 as part of a search of the input data.Stepping through the automaton may generate a current state 302 of thefinite automaton 300, which may change as the interpreter module 206follows transitions 304 with labels that match the symbols read from theinput data. If a terminal state 302B is reached, the finite automaton300 has matched a pattern corresponding to the automaton 300 with theinput data. If a non-terminal state 302 is reached that has notransition 304 for the current symbol, the finite automaton 30 has notmatched the pattern with the input. A finite automaton 300 may bedefined as a deterministic finite automaton (DFA) if no state 302 hastwo or more outgoing transitions 304 with the same label; otherwise, theautomaton 300 may be a nondeterministic finite automaton (NFA). Invarious examples, every NFA has an equivalent regular expression, andvice versa.

Multipattern searching may apply tagged transitions 304 to the patternmatches, in contrast to applying tagged transitions to submatches. Forinstance, instead of using an array of submatch positions in theautomaton 300, each state 302 may have one or more scalar values for thestarting offset of the match, ending offset and value of the last taggedtransition 304. Transitions 304 may be tagged to match states 302 withthe corresponding index numbers of the patterns. In an example, thebaseline complexity of an NFA search may using the search tool 100 beO(nm), where n is the number of patterns and m is the length of theinput data. In various examples, several practical optimizations asdisclosed herein may be incorporated in the search tool 100 to improveperformance over the baseline complexity by utilizing relatively largeautomata 300.

In an example, instead of using an NFA directly, the search tool 100,such as the interpreter module 206, may compile patterns into a commandsequence using commands such as: literal c; fork n; jump n; match n; andhalt. literal c may increment an instruction and suspend a currentthread if the current symbol is c, or, otherwise, terminate the currentthread. fork n may create a new thread at instruction n at the currentoffset and increment the instruction. jump n may go to instruction n.match n may record a match for pattern n ending at the current offsetand increment the instruction. halt may terminate the current thread andreport a match, such as may be output by the output module 212, if amatch exists. Given a list of patterns to match from the pattern inputmodule 202 and a stream of input from the data input module 204, athread of the search tool 100 may then be executed by an interpretermodule 206 to produce a list of matches.

In an example, each thread is a tuple (s, i, j, k) where s is thecurrent instruction, i is the start (inclusive) of a matched patternwith the input data, i is the end (exclusive) of the match, and k is theindex of the matched pattern. In an example, when a thread is created,it is initialized to (0, p, ø, ø) where p is the current position in theinput data. A zero (0) for the start or end of a match indicates that amatch starts or ends at offset 0; a null (ø) indicates no match.

Example Implementation

FIG. 4 is a nondeterministic finite automaton (NFA) 400 for an exemplarysequence utilizing an example of the search tool 100. An exemplary inputdata is qabcabdbd. Exemplary search patterns are “a (bd)+” and “abc”.The command sequence or “bytecode” corresponding to the automaton 400for such search patterns is:

0 literal “a” 1 fork 6 2 literal “b” 3 literal “d” 4 match 0 5 jump 2 6literal “b” 7 literal “c” 8 match 1 9 halt

The commands corresponding to the automaton 400 are executed as thesymbol considered in the input data is sequentially advanced one symbolat a time by the interpreter module 206. The leftmost column lists thethread ID, the second column specifies the thread and the third columnprovides an explanation of the step.

1: qabcabdbd 0

 0, 0, ∅, ∅ 

thread 0 created 0

 0, 0, ∅, ∅ 

literal “a” fails; thread terminates 2: qabcabdbd 1

 0, 1, ∅, ∅ 

thread 1 created 1

 0, 1, ∅, ∅ 

literal “a” succeeds 1

 0, 2, ∅, ∅ 

advance instruction and suspend 3: qabcabdbd 2

 0, 2, ∅, ∅ 

thread 2 created 2

 0, 2, ∅, ∅ 

literal “a” fails; thread terminates 1

 1, 1, ∅, ∅ 

fork 6 creates thread 3 3

 6, 1, ∅, ∅ 

thread 3 created 1

 2, 1, ∅, ∅ 

advance instruction 1

 2, 1, ∅, ∅ 

literal “b” succeeds 1

 3, 1, ∅, ∅ 

advance instruction and suspend 3

 6, 1, ∅, ∅ 

literal “b” succeeds 3

 7, 1, ∅, ∅ 

advance instruction and suspend 4: qabcabdbd 4

 0, 3, ∅, ∅ 

thread 4 created 4

 0, 3, ∅, ∅ 

literal “a” fails; thread terminates 1

 3, 1, ∅, ∅ 

literal “d” fails; thread terminates 3

 7, 1, ∅, ∅ 

literal “c” succeeds 3

 8, 1, ∅, ∅ 

advance instruction and suspend 5: qabcqabdbd 5

 0, 4, ∅, ∅ 

thread 5 created 5

 0, 4, ∅, ∅ 

literal “a” fails; thread terminates 3

 8, 1, ∅, ∅ 

match 1 3

 8, 1, 4, 0 

set match pattern and end offset 3

 9, 1, 4, 0 

advance instruction 3

 9, 1, 4, 0 

halt; reports match on pattern 1 at [1,4), terminates 6: qabcqabdbd 6

 0, 4, ∅, ∅ 

thread 6 created 6

 0, 4, ∅, ∅ 

literal “a” succeeds 6

 1, 5, ∅, ∅ 

advance instruction and suspendFor simplicity, from here on the creation of threads that immediatelyterminate because of the failure to match the current symbol are notspecifically addressed, though one of ordinary skill in the art willrecognize that such threads are created based on the commandscorresponding to the automaton 400 above and the example of thepreceding symbols of the input data.

7: qabcqabdbd 6

 1, 5, ∅, ∅ 

fork 6 creates thread 7 7

 6, 5, ∅, ∅ 

thread 7 created 6

 2, 5, ∅, ∅ 

advance instruction 6

 2, 5, ∅, ∅ 

literal “b” succeeds 6

 3, 5, ∅, ∅ 

advance instruction and suspend 7

 6, 5, ∅, ∅ 

literal “b” succeeds 7

 7, 5, ∅, ∅ 

advance instruction and suspend 8: qabcqabdbd 6

 3, 5, ∅, ∅ 

literal “d” succeeds 7

 4, 5, ∅, ∅ 

advance instruction and suspend 7

 7, 5, ∅, ∅ 

literal “c” fails; thread terminates 9: qabcqabdbd 6

 4, 5, ∅, ∅ 

match 0 6

 4, 5, 8, 1 

set match pattern and end offset 6

 5, 5, 8, 1 

advance instruction 6

 5, 5, 8, 1 

jump 2 6

 2, 5, 8, 1 

goto instruction 2 6

 2, 5, 8, 1 

literal “b” succeeds 6

 3, 5, 8, 1 

advance instruction and suspend 10: qabcqabdbd 6

 3, 5, 8, 1 

literal “d” succeeds 6

 4, 5, 8, 1 

advance instruction and suspend11: Having reached the end of the input data, the remaining threads rununtil they terminate:

6

 4, 5, 8, 1 

match 0 6

 4, 5, 10, 1 

set match pattern and end offset 6

 5, 5, 10, 1 

advance instruction 6

 5, 5, 10, 1 

jump 2 6

 2, 5, 10, 1 

goto instruction 2 6

 2, 5, 10, 1 

literal “b” fails; reports match of pattern 0 at [4,9); threadterminates

The execution of these commands by the interpreter module 206corresponding to the automaton 400 reports a match for abc at [1, 4) anda match for a(bd)+ at [4, 9). As is illustrated, the single automaton400 may thereby be utilized, such as by the interpreter module 206, tosearch for two different patterns in the input data simultaneously. Asillustrated, each symbol of the input data is tested only once.Furthermore, it is to be understood that the principles disclosed and/orillustrated herein are scalable, and that an automaton 300 may beconstructed based on three or more patterns. As such, the number ofpatterns that may be searched while testing each symbol of input dataonly once may be limited only by the resources of the search tool 100 orthe system 200, such as by the amount of available memory 104.

It is emphasized that while a testing of each symbol only once may,under various circumstances, reduce the amount of time and resourcesconsumed in comparison with testing some or all symbols more than once,the search tool 100 is not necessarily limited to testing each symbolonly once. A search tool that otherwise operates according to thepresent disclosure that happens to test certain symbols more than oncemay still generally meet the terms of the present disclosure.

Applying an automaton 300 to input data as disclosed and illustratedherein may support the analysis of input data that is obtained serially.For instance, conventional electronic data storage devices, such as ahard drive, may automatically stream stored data serially. In anexample, as each byte is streamed from the electronic data storagedevice, each byte may be tested upon being obtained by the data inputmodule 204 or upon all of the bytes of a particular character beingobtained by the data input module 204. In various examples, input datamay be tested against the automaton 300 essentially as quickly as theinput data is accessed from a data source. It is to be understood that,even though some or all of the input data may be obtained by the datainput module as a block, with any one or more of the symbols beingaccessible in any order, individual symbols may nevertheless be accessedby the interpreter module 206 from the block of input data and, uponbeing accessed, tested against the automaton 300.

Thread Creation

As illustrated above, the search tool 100 generally and the interpretermodule 206 specifically may minimize thread creation, such as fromunnecessary alternation. In various examples, rather than treating eachpattern as a separate branch of the automaton 400, at least somepatterns may be merged into the automaton 400 as the patterns are parsedto form a trie. A trie, also known as a prefix tree, is a tree whoseroot corresponds to the empty string, and every other node extends thestring of its parent by one symbol. A trie may be a type of acyclicdeterministic finite automaton (DFA). The merging may take into accountnot only the criteria of the transitions 304, but also the sets ofsource and target states 302B. In an example, a Glushkovnondeterministic finite automaton (NFA) form is utilized by the searchtool. See, e.g., Glushkov, “The abstract theory of automata,” RussianMathematical Surveys, volume 16(5) (1961), pages 1-53, incorporatedherein in its entirety.

Jump Tables

In various examples, one thread is forked to handle each successor of agiven state 302. Some NFA 400 states 302 may have a large number ofsuccessor states 302, making the creation of new threads costly in termsof time and computing resource consumption. For example, the first state302 may have a relatively large number k of outbound transitions 304when many patterns are specified. Therefore, every symbol read from theinput stream causes k new threads to be created, almost all of which mayterminate immediately due to the lack of a match.

Various examples of the search tool 100 determine the threads that willnot terminate prior to reaching at least one subsequent state 302 andspawn only these threads. Such a determination may be made by theinterpreter module 206. The interpreter module 206 may produce a jumptable, such as a jumptable instruction. In such examples, the jumptableinstruction sits at the head of a list of, for instance, two hundredfifty-six (256) consecutive instructions, or one instruction for eachpossible value of a current byte. When the jumptable instruction isreached with byte b, execution jumps ahead b+1 instructions andcontinues from there. The instruction offset b+1 from the jumptable maybe a jump in the case of a match (in order to get out of the jumptable); otherwise, it may be a halt. If more than one transition ispossible for byte b, then a list of appropriate fork and jumpinstructions may be appended to the table and the jump instruction forbyte b targets this table. Consequently, in such examples, only thethreads that succeed are spawned. The interpreter module 206 may specifyjumps to states just beyond their literal instructions, such as toprevent b from being evaluated twice.

In various examples, a sibling instruction, such as jumptablerange, maybe used when the difference between the minimum and maximum acceptedbyte values is small, such as from zero (0) to two hundred fifty-six(256) bytes. The sibling instruction may operate by checking that thebyte value is in range and only then indexing into the table, forinstance, to reduce the table size. In various examples, the range isproduced by the coding module 210 and utilized by the interpreter module206.

State Synchronization

A typical simulation of an automaton 300 may utilize a bit vector (suchas containing a bit for each state 302) to track the states 302 that arevisited for the current symbol in the input data stream in order toavoid duplicating work. In such a simulation, the number of automaton300 states 302 may depend on the combined length of the search patternsthat are used. Therefore, a search that uses a large number of patterns(even fixed-string patterns) may result in the bit vector beingrelatively long. In various circumstances, the bit vector is clearedafter each new symbol of the input stream is obtained.

However, it may be possible to determine that it may be impossible fortwo threads to arrive at the same state 302. In various examples, it maybe impossible for two threads to arrive at the same state 302 at thesame input data symbol position unless the state 302 has multipletransitions 304 leading to it. Therefore, in an example, only the states302 with multiple predecessor states 302 may utilize bits in the currentstate vector. In such an example, bits for other states may be omitted.

Consequently, some or all of the states 302 that are susceptible tohaving multiple threads arrive concurrently may be identified. Based onthe identification of such states 302, the susceptible states 302 mayinclude bits in the current state vector while other states may omitsuch bits. As a result, resources and time may be saved in comparisonwith including the bits in the current state vector for every state 302.

Various examples of the search tool 100 may associate an index with eachstate 302 having multiple incoming transitions 304, such as by using achkhalt instruction. Such an instruction is inserted before outboundtransition 304 instructions associated with a state 302 that may utilizesynchronization. The index associated with the state 302 may bespecified as an operand to chkhalt, which may use the instruction totest the corresponding value in a bit vector. In an example, the bit isset if it is currently unset, and execution proceeds. In such anexample, if the bit is already set, then the thread terminates.Consequently, the size of the bit vector may be reduced or otherwiseminimized and safe transitions 304, which may occur frequently inpractice, may be left unguarded.

Complex Instruction Set

The search tool 100 generally, and the interpreter module 206specifically, may introduce new instructions to handle common cases. Forexample, the instruction may have two operands and continue execution ifthe current byte matches either operand. Similarly, an instruction rangemay have two operands and continue if the current byte has a value thatfalls within their range inclusively. More complex symbol classes may behandled with a bitvector instruction, such as an instruction followed bytwo hundred fifty-six (256) bits, where each bit is set to one if thecorresponding byte value is permitted. If several states 302 have thesame source and target states 302, their transitions 304 can becollapsed into a single bitvector instruction. In various examples, anew instruction may preferably be introduced if the new instruction caneliminate sources of alternation.

Compilation

Various examples of the search tool 100 may use a hybridbreadth-first/depth-first search scheme for laying out generatedinstructions. In an example, instructions for states 302 may first belaid out in breadth-first order of discovery; the discovery may switchto a depth-first search when a parent state 302 has a single transition304. In various examples, advantageously, subsequent states 302 maygenerally be close to their parent states 302 due to breadth-firstdiscovery. Further, the total number of instructions used may bereduced, in certain circumstances significantly, in linear sequences ofstates because jump and fork instructions may not be used between them.

Greedy Vs. Non-Greedy Matching

In various circumstances, an example thread may not identify a pattern<html>. *</html> fragments from, for instance, unallocated space in afile system of a hard drive. In an example, though the pattern may matchthe first fragment, a thread may continue the match attempts, eventuallyproducing a match on the ends of subsequent fragments (if they exist)and reporting one long match.

A repetition operator such as <html>. *?</html> that results in theshortest possible matches may be referred to as a “non-greedy” operator.By executing threads spawned by a fork command before the threads'associated parent threads, it may be possible to control the prioritygiven to an alternation. In contrast, in the above example, .*? maygenerate one match for each fragment.

Positional Assertions

Input data may include positional assertions in patterns. For example, apattern may assert that it must match the pattern on a certain line andin a certain column of certain input data, such as a text file. A fileformat may have an optional record that may be identified with apattern, but that is known to occur only at a given offset. Further,searching functions may be limited to data that is sector-aligned. Invarious examples, the search tool may utilize syntax such as (?i %j@regex) and (?i % j@regex), where i is either an absolute or modulobyte offset and j is a divisor. In such examples, (?0%512@)PK may matchsector-aligned .zip file headers.

Multiple Encodings

Certain regular expression libraries known in the art with Unicodesupport rely on data to be decoded to Unicode symbols beforeconsideration by the a search routine; the assumption in suchcircumstances may be that the data to be searched is stored in a singleencoding. Such an assumption may not be applicable under variouscircumstances, such as in digital forensics, when searching unstructureddata, encodings may change capriciously between and among a variety ofencodings, such as American Standard Code for Information Interchange(ASCII) to Universal Symbol Set Transformation Format—8-bit (UTF-8) and−16-bit (UTF-16) to a legacy code page.

A coded symbol set may be understood to be a list of pairs, eachconsisting of a symbol and a unique integer representing the symbol,which may be known as a code point. An encoding is a method for mappingsequences of code points to sequences of bytes. Unicode, for example, isa coded symbol set consisting of 1,114,112 code points, intended to besufficient for representing all text produced by humans. UTF-8 andUTF-16 are encodings capable of representing all Unicode code points asbytes. ASCII, commonly used for simple text files (especially by Englishspeakers), is both a coded symbol set and an encoding—the 128 codepoints in ASCII numbered 0-127, are directly encoded as bytes 0-127.Numerous encodings specific to one or more natural languages have beendeveloped, such as Shift JIS, EUC-KR, KOI8-R, and ISO 8859-1.

The multiplicity of encodings means that one piece of text may bepresented as bytes in numerous ways. For instance, the text string“IRLIBSYR” can be encoded as in FIG. 8.

As illustrated in FIG. 8, the UTF-16LE encoding, while containing valuessimilar to the UTF-8 encoding, is double the length the length of UTF-8,while the EBCDIC 37 encoding may bear little to no resemblance toUTF-16LE and UTF-8. For searching tools that are not sensitive tomultiple encodings, searching a block of bytes for “IRLIBSYR” may meansearching for “IRLIBSYR” once for each possible encoding. Hundreds ofencodings exist in the art, dozens of which are in common contemporaryusage. As such, establishing the existence of a particular patternwithin an input data for which the encoding is unknown beforehand mayresult in vastly expanded effort, an incomplete search, or both ofthese.

The coding module 210, however, may utilize multiple pattern encoding,resulting in searching in parallel for one pattern rendered in multipleencodings. In various examples, the search tool utilizes the samemethodologies for parallel searching of multiple patterns on a singlepass of the input data. Hence, in an example, a search for the patternIRLIBSYR.*?SACPOP in UTF-8, UTF-16LE, UTF-16BE, UTF-32LE, and UTF-32BEdoes not utilize five passes over the input data (once for eachencoding), but merely adds five patterns to the pattern set from whichthe search automaton 300 is built.

FIG. 5 is an illustration of a code point-code point and byte-bytetransformation chain 500. As illustrated, the sequence 502 includes codepoints 504 and bytes 506 transformed in sequence. Symbol encodings maybe understood to be a special case of the more general case oftransformations that map code points 504 or sequences of code points 504to code points 504 or sequences of code points 504, and sequences ofbytes 506 to sequences of bytes 506, and that such transformations couldbe chained together, as shown in the sequence 502, such as may produce atransformation 508 from code points 504 to bytes 506. Consequently, thesearch tool 100 may be sensitive to text which is both encoded andtransformed, such as according to a cipher. The byte 506A may representthe transition from code points 504 to bytes 506.

As a result, various examples of the coding module 210 provide for userspecification of transformation chains 500. The transformation chain 500may permit specifying multiple encodings for each pattern. Thus, forinstance, a user, by way of the user interface 106 and the coding module210, may specify the transformation chain UTF-8|OCE to cause each bytein a sequence searched for to first be UTF-8-encoded, then subjected toOutlook Compressible Encryption (OCE) without the user specifically orpreviously acting on the pattern.

In various examples, the search tool 100 generally and the coding module210 specifically, is explicitly byte-oriented. In order to search foralternate encodings of a pattern, the various binary representations maybe generated as separate patterns in the automaton 300. Matches can thenbe resolved back to the user-specified term and appropriate encodingusing a table.

In various examples, the search tool 100 may search for ASCII-specifiedpatterns as ASCII and as UTF-16. In addition to specifying theparticular encodings to be used for a given search term, users may, invarious examples, choose an automatic mode, where the symbols of akeyword are considered as Unicode code points. Unique binaryrepresentations, such as all related unique binary representations, maythen generated from the list of supported ICU encodings, such as in aidof searches for foreign-language keywords.

FIG. 6 is a flowchart for searching input data including symbols. Theflowchart is discussed with particularity to the search tool 100 and thesystem 200, though it is to be understood that the flowchart may beimplemented with respect to any suitable search tool and/or system.Further, the search tool 100 and system 200 may be utilized according toany of a variety of alternative flowcharts and related methods.

At 600, regular expression patterns of symbols are received, such as bythe pattern input module 202. In an example, the pattern as inputcorresponds to a first encoding. In an example, at least one of thepatterns comprises a string of at least one symbol and at least oneoperator, wherein the operator specifies variable combinations ofsymbols within the at least one of the patterns.

At 602, another pattern is generated by the coding module 210 accordingto the pattern as input at 600 and a second encoding different from thefirst encoding. In an example, a plurality of encodings includes thefirst and second encodings, and generating a plurality of patternsgenerates the plurality of patterns corresponding to each encoding ofthe plurality of encodings not corresponding to the encoding of thepattern as input.

At 604, an automaton 300, 400 corresponding to the patterns is generatedby the pattern analyzer 208 based on common symbols between the regularexpression patterns.

At 606, one or more symbols of the input data are accessed, such as bythe interpreter module 204. The symbols of the input data may be beenobtained by the data input module 204.

At 608, at least one thread is compared against the symbol with theinterpreter module 206, the at least one thread being based on at leastone of the patterns and each of the patterns corresponding to at leastone of the threads. In an example, the threads are compared based on thepatterns as received by the pattern input module 202 and as generated bythe coding module 210. The at least one thread is compared against thesymbol prior to the at least one thread being compared against asubsequent symbol of the input data. In various examples, each symbol ofthe input data is compared against the plurality of threads once beforeany one symbol is compared against the plurality of threads more thanonce. In an example, each symbol of the input data is compared againstthe plurality of threads only once.

In an example, the thread comprises a plurality of discrete instructionsand the at least one thread is compared against the symbol byimplementing individual ones of the discrete instructions. In anexample, each of the patterns corresponds to a common automaton 300, 400comprising a plurality of commands, and each of the threads is generatedbased on the automaton 300, 400. In an example, the plurality ofcommands of the automaton produces the patterns that correspond to theautomaton 300, 400.

At 610, the interpreter module 206 and/or the data input module 204determine if the accessing of the input data is complete. If not, theinterpreter module 206 returns to operation 606 and accesses asubsequent symbol.

At 612, if the accessing of the input data is complete, the outputmodule 212 outputs an indication of ones of the patterns determined tobe contained within the input data based on the comparison of thecorresponding at least one thread to the symbols of the input data. Theindication may be an indication that one or more of the patterns has orhas not been identified in the input data, which patterns have beenidentified, where, in what encoding, what data in the input data may bein proximity of the pattern, and so forth.

FIG. 7 is a block diagram illustrating components of a machine 700,according to some example examples, able to read instructions from amachine-readable medium (e.g., a machine-readable storage medium) andperform any one or more of the methodologies discussed herein.Specifically, FIG. 7 shows a diagrammatic representation of the machine700 in the example form of a computer system and within whichinstructions 724 (e.g., software) for causing the machine 700 to performany one or more of the methodologies discussed herein may be executed.In alternative examples, the machine 700 operates as a standalone deviceor may be connected (e.g., networked) to other machines. In a networkeddeployment, the machine 700 may operate in the capacity of a servermachine or a client machine in a server-client network environment, oras a peer machine in a peer-to-peer (or distributed) networkenvironment. The machine 700 may be a server computer, a clientcomputer, a personal computer (PC), a tablet computer, a laptopcomputer, a netbook, a set-top box (STB), a personal digital assistant(PDA), a cellular telephone, a smartphone, a web appliance, a networkrouter, a network switch, a network bridge, or any machine capable ofexecuting the instructions 724, sequentially or otherwise, that specifyactions to be taken by that machine. Further, while only a singlemachine is illustrated, the term “machine” shall also be taken toinclude a collection of machines that individually or jointly executethe instructions 724 to perform any one or more of the methodologiesdiscussed herein.

The machine 700 includes a processor 702 (e.g., a central processingunit (CPU), a graphics processing unit (GPU), a digital signal processor(DSP), an application specific integrated circuit (ASIC), aradio-frequency integrated circuit (RFIC), or any suitable combinationthereof), a main memory 704, and a static memory 706, which areconfigured to communicate with each other via a bus 708. The machine 700may further include a graphics display 710 (e.g., a plasma display panel(PDP), a light emitting diode (LED) display, a liquid crystal display(LCD), a projector, or a cathode ray tube (CRT)). The machine 700 mayalso include an alphanumeric input device 712 (e.g., a keyboard), acursor control device 714 (e.g., a mouse, a touchpad, a trackball, ajoystick, a motion sensor, or other pointing instrument), a storage unit716, a signal generation device 718 (e.g., a speaker), and a networkinterface device 720.

The storage unit 716 includes a machine-readable medium 722 on which isstored the instructions 724 (e.g., software) embodying any one or moreof the methodologies or functions described herein. The instructions 724may also reside, completely or at least partially, within the mainmemory 704, within the processor 702 (e.g., within the processor's cachememory), or both, during execution thereof by the machine 700.Accordingly, the main memory 704 and the processor 702 may be consideredas machine-readable media. The instructions 724 may be transmitted orreceived over a network 726 via the network interface device 720.

As used herein, the term “memory” refers to a machine-readable mediumable to store data temporarily or permanently and may be taken toinclude, but not be limited to, random-access memory (RAM), read-onlymemory (ROM), buffer memory, flash memory, and cache memory. While themachine-readable medium 722 is shown in an example to be a singlemedium, the term “machine-readable medium” should be taken to include asingle medium or multiple media (e.g., a centralized or distributeddatabase, or associated caches and servers) able to store instructions.The term “machine-readable medium” shall also be taken to include anymedium, or combination of multiple media, that is capable of storinginstructions (e.g., software) for execution by a machine (e.g., machine700), such that the instructions, when executed by one or moreprocessors of the machine (e.g., processor 702), cause the machine toperform any one or more of the methodologies described herein.Accordingly, a “machine-readable medium” refers to a single storageapparatus or device, as well as “cloud-based” storage systems or storagenetworks that include multiple storage apparatus or devices. The term“machine-readable medium” shall accordingly be taken to include, but notbe limited to, one or more data repositories in the form of asolid-state memory, an optical medium, a magnetic medium, or anysuitable combination thereof.

Throughout this specification, plural instances may implementcomponents, operations, or structures described as a single instance.Although individual operations of one or more methods are illustratedand described as separate operations, one or more of the individualoperations may be performed concurrently, and nothing requires that theoperations be performed in the order illustrated. Structures andfunctionality presented as separate components in example configurationsmay be implemented as a combined structure or component. Similarly,structures and functionality presented as a single component may beimplemented as separate components. These and other variations,modifications, additions, and improvements fall within the scope of thesubject matter herein.

Certain embodiments are described herein as including logic or a numberof components, modules, or mechanisms. Modules may constitute eithersoftware modules (e.g., code embodied on a machine-readable medium or ina transmission signal) or hardware modules. A “hardware module” is atangible unit capable of performing certain operations and may beconfigured or arranged in a certain physical manner. In various exampleembodiments, one or more computer systems (e.g., a standalone computersystem, a client computer system, or a server computer system) or one ormore hardware modules of a computer system (e.g., a processor or a groupof processors) may be configured by software (e.g., an application orapplication portion) as a hardware module that operates to performcertain operations as described herein.

In some embodiments, a hardware module may be implemented mechanically,electronically, or any suitable combination thereof. For example, ahardware module may include dedicated circuitry or logic that ispermanently configured to perform certain operations. For example, ahardware module may be a special-purpose processor, such as a fieldprogrammable gate array (FPGA) or an ASIC. A hardware module may alsoinclude programmable logic or circuitry that is temporarily configuredby software to perform certain operations. For example, a hardwaremodule may include software encompassed within a general-purposeprocessor or other programmable processor. It will be appreciated thatthe decision to implement a hardware module mechanically, in dedicatedand permanently configured circuitry, or in temporarily configuredcircuitry (e.g., configured by software) may be driven by cost and timeconsiderations.

Accordingly, the phrase “hardware module” should be understood toencompass a tangible entity, be that an entity that is physicallyconstructed, permanently configured (e.g., hardwired), or temporarilyconfigured (e.g., programmed) to operate in a certain manner or toperform certain operations described herein. As used herein,“hardware-implemented module” refers to a hardware module. Consideringembodiments in which hardware modules are temporarily configured (e.g.,programmed), each of the hardware modules need not be configured orinstantiated at any one instance in time. For example, where a hardwaremodule comprises a general-purpose processor configured by software tobecome a special-purpose processor, the general-purpose processor may beconfigured as respectively different special-purpose processors (e.g.,comprising different hardware modules) at different times. Software mayaccordingly configure a processor, for example, to constitute aparticular hardware module at one instance of time and to constitute adifferent hardware module at a different instance of time.

Hardware modules can provide information to, and receive informationfrom, other hardware modules. Accordingly, the described hardwaremodules may be regarded as being communicatively coupled. Where multiplehardware modules exist contemporaneously, communications may be achievedthrough signal transmission (e.g., over appropriate circuits and buses)between or among two or more of the hardware modules. In embodiments inwhich multiple hardware modules are configured or instantiated atdifferent times, communications between such hardware modules may beachieved, for example, through the storage and retrieval of informationin memory structures to which the multiple hardware modules have access.For example, one hardware module may perform an operation and store theoutput of that operation in a memory device to which it iscommunicatively coupled. A further hardware module may then, at a latertime, access the memory device to retrieve and process the storedoutput. Hardware modules may also initiate communications with input oroutput devices, and can operate on a resource (e.g., a collection ofinformation).

The various operations of example methods described herein may beperformed, at least partially, by one or more processors that aretemporarily configured (e.g., by software) or permanently configured toperform the relevant operations. Whether temporarily or permanentlyconfigured, such processors may constitute processor-implemented modulesthat operate to perform one or more operations or functions describedherein. As used herein, “processor-implemented module” refers to ahardware module implemented using one or more processors.

Similarly, the methods described herein may be at least partiallyprocessor-implemented, a processor being an example of hardware. Forexample, at least some of the operations of a method may be performed byone or more processors or processor-implemented modules. Moreover, theone or more processors may also operate to support performance of therelevant operations in a “cloud computing” environment or as a “softwareas a service” (SaaS). For example, at least some of the operations maybe performed by a group of computers (as examples of machines includingprocessors), with these operations being accessible via a network (e.g.,the Internet) and via one or more appropriate interfaces (e.g., anapplication program interface (API)).

The performance of certain of the operations may be distributed amongthe one or more processors, not only residing within a single machine,but deployed across a number of machines. In some example embodiments,the one or more processors or processor-implemented modules may belocated in a single geographic location (e.g., within a homeenvironment, an office environment, or a server farm). In other exampleembodiments, the one or more processors or processor-implemented modulesmay be distributed across a number of geographic locations.

Additional Examples

Example 1 may include subject matter (such as an apparatus, a method, ameans for performing acts) that can include a system configured tosearch input data including symbols. The system may include a patterninput module, configured to receive regular expression patterns ofsymbols. An interpreter module, configured to access individual ones ofthe symbols of the input data and, may, upon accessing each symbol,compare at least one thread against the symbol, the at least one threadbeing based on at least one of the patterns and each of the patternscorresponding to at least one of the threads. For each pattern, the atleast one thread corresponding to the pattern is compared against thesymbol prior to the at least one thread being compared against asubsequent symbol of the input data. An output module may be configuredto output an indication of ones of the patterns determined to becontained within the input data based on the comparison of thecorresponding at least one thread to the symbols of the input data.

In Example 2, the system of Example 1 can optionally further includethat, for each pattern, each symbol of the input data is comparedagainst the plurality of threads once before any one symbol is comparedagainst the plurality of threads more than once.

In Example 3, the system of any one or more of Examples 1 and 2 canoptionally further include that each symbol of the input data iscompared against the plurality of threads only once.

In Example 4, the system of any one or more of Examples 1-3 canoptionally further include that the thread corresponds to a sequence ofcommands and wherein the at least one thread is compared against thesymbol by implementing individual ones of the plurality of commands.

In Example 5, the system of any one or more of Examples 1-4 canoptionally further include that each of the patterns corresponds to acommon automaton corresponding to a plurality of commands, the sequenceof commands being a subset of the plurality of commands, and whereineach of the threads is generated based on the plurality of commands.

In Example 6, the system of any one or more of Examples 1-5 canoptionally further include that the plurality of commands correspondingto the automaton produces the patterns that correspond to the automaton.

In Example 7, the system of any one or more of Examples 1-6 canoptionally further include a pattern analyzer configured to generate theautomaton corresponding to the patterns based on common symbols betweenthe patterns.

In Example 8, the system of any one or more of Examples 1-7 canoptionally further include that a pattern as input by the pattern inputmodule corresponds to a first encoding, and further include a codingmodule configured to generate another pattern according to the patternas input and a second encoding different from the first encoding,wherein the interpreter module is configured to compare threads based onthe patterns as received by the pattern input module and as generated bythe coding module.

In Example 9, the system of any one or more of Examples 1-8 canoptionally further include that the coding module is configured with aplurality of encodings including the first and second encodings, andwherein the coding module is configured to generate a plurality ofpatterns corresponding to each encoding of the plurality of encodingsnot corresponding to the encoding of the pattern as input.

In Example 10, the system of any one or more of Examples 1-9 canoptionally further include that at least one of the patterns comprises astring of at least one symbol and at least one operator, wherein theoperator specifies variable combinations of symbols within the at leastone of the patterns.

Example 11 may include subject matter (such as an apparatus, a method, ameans for performing acts) that can include a method for searching inputdata including symbols. Regular expression patterns of symbols arereceived. The symbols of the input data are accessed and, upon accessingeach symbol, compared at least one thread against the symbol, the atleast one thread being based on at least one of the patterns and each ofthe patterns corresponding to at least one of the threads. The at leastone thread is compared against the symbol prior to the at least onethread being compared against a subsequent symbol of the input data. Anindication of ones of the patterns determined to be contained within theinput data is outputted based on the comparison of the corresponding atleast one thread to the symbols of the input data.

In Example 12, the method of Example 11 can optionally further includethat, for each pattern, each symbol of the input data is comparedagainst the plurality of threads once before any one symbol is comparedagainst the plurality of threads more than once.

In Example 13, the method of any one or more of Examples 11 and 12 canoptionally further include that each symbol of the input data iscompared against the plurality of threads only once.

In Example 14, the method of any one or more of Examples 11-13 canoptionally further include that the thread corresponds to a sequence ofcommands and wherein the at least one thread is compared against thesymbol by implementing individual ones of the plurality of commands.

In Example 15, the method of any one or more of Examples 11-14 canoptionally further include that each of the patterns corresponds to acommon automaton corresponding to a plurality of commands, the sequenceof commands being a subset of the plurality of commands, and whereineach of the threads is generated based on the plurality of commands.

In Example 16, the method of any one or more of Examples 11-15 canoptionally further include that the plurality of commands correspondingto the automaton produces the patterns that correspond to the automaton.

In Example 17, the method of any one or more of Examples 11-16 canoptionally further include generating the automaton corresponding to thepatterns based on common symbols between the patterns.

In Example 18, the method of any one or more of Examples 11-17 canoptionally further include that a pattern as input by the pattern inputmodule corresponds to a first encoding, and further including generatinganother pattern according to the pattern as input and a second encodingdifferent from the first encoding, wherein comparing threads is based onthe patterns as received and as generated.

In Example 19, the method of any one or more of Examples 11-18 canoptionally further include that the coding module is configured with aplurality of encodings including the first and second encodings, andwherein the coding module is configured to generate a plurality ofpatterns corresponding to each encoding of the plurality of encodingsnot corresponding to the encoding of the pattern as input.

In Example 20, the method of any one or more of Examples 11-19 canoptionally further include that at least one of the patterns comprises astring of at least one symbol and at least one operator, wherein theoperator specifies variable combinations of symbols within the at leastone of the patterns.

Each of these non-limiting examples can stand on its own, or can becombined with one or more of the other examples in any permutation orcombination.

The above detailed description includes references to the accompanyingdrawings, which form a part of the detailed description. The drawingsshow, by way of illustration, specific embodiments in which theinvention can be practiced. These embodiments are also referred toherein as “examples.” Such examples can include elements in addition tothose shown or described. However, the present inventors alsocontemplate examples in which only those elements shown or described areprovided. Moreover, the present inventors also contemplate examplesusing any combination or permutation of those elements shown ordescribed (or one or more aspects thereof), either with respect to aparticular example (or one or more aspects thereof), or with respect toother examples (or one or more aspects thereof) shown or describedherein.

In this document, the terms “a” or “an” are used, as is common in patentdocuments, to include one or more than one, independent of any otherinstances or usages of “at least one” or “one or more.” In thisdocument, the term “or” is used to refer to a nonexclusive or, such that“A or B” includes “A but not B,” “B but not A,” and “A and B,” unlessotherwise indicated. In this document, the terms “including” and “inwhich” are used as the plain-English equivalents of the respective terms“comprising” and “wherein.” Also, in the following claims, the terms“including” and “comprising” are open-ended, that is, a system, device,article, composition, formulation, or process that includes elements inaddition to those listed after such a term in a claim are still deemedto fall within the scope of that claim. Moreover, in the followingclaims, the terms “first,” “second,” and “third,” etc. are used merelyas labels, and are not intended to impose numerical requirements ontheir objects.

The above description is intended to be illustrative, and notrestrictive. For example, the above-described examples (or one or moreaspects thereof) may be used in combination with each other. Otherembodiments can be used, such as by one of ordinary skill in the artupon reviewing the above description. The Abstract is provided to complywith 37 C.F.R. § 1.72(b), to allow the reader to quickly ascertain thenature of the technical disclosure. It is submitted with theunderstanding that it will not be used to interpret or limit the scopeor meaning of the claims. Also, in the above Detailed Description,various features may be grouped together to streamline the disclosure.This should not be interpreted as intending that an unclaimed disclosedfeature is essential to any claim. Rather, inventive subject matter maylie in less than all features of a particular disclosed embodiment.Thus, the following claims are hereby incorporated into the DetailedDescription, with each claim standing on its own as a separateembodiment, and it is contemplated that such embodiments can be combinedwith each other in various combinations or permutations. The scope ofthe invention should be determined with reference to the appendedclaims, along with the full scope of equivalents to which such claimsare entitled.

What is claimed is:
 1. A system configured to search input dataincluding symbols, comprising: a pattern input module, configured toreceive regular expression patterns of symbols; an interpreter module,configured to access individual ones of the symbols of the input dataand, upon accessing each symbol, compare at least one thread against thesymbol, the at least one thread being based on at least one of thepatterns and each of the patterns corresponding to at least one of thethreads; wherein, for each pattern, the at least one threadcorresponding to the pattern is compared against the symbol prior to theat least one thread being compared against a subsequent symbol of theinput data; and an output module configured to output an indication ofones of the patterns determined to be contained within the input databased on the comparison of the corresponding at least one thread to thesymbols of the input data.